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Abstract — A human listener can discriminate easily between 
speech and music signals by listening to a short segment (i.e., few 
seconds) of an audio signal. In the problem of classification of 
audio signals, the requirements of low-complexity, 
high-accuracy and short delay are crucial for some practical 
scenarios. While many efforts have been made in the audio 
signal classification field, the noise embedded signal problem is 
seldom concerned so far, especially in many telecommunication 
applications, where a real-time and noise robust approach is 
needed. In this paper we review work on real world noise with 
less computation time. 

Keywords — Pink noise, Signal to Noise Ratio (SNR), Support 
Vector Machine (SVM), White noise, Colored noise. 


feature values are extracted. Following clip level features are 
extracted and analysed: 


Average Pitch Density (APD) - It represents the 
differences of tones between speech and music. Real 
cepstrum is used to analysis the pitch information, since 
cepstrum is a powerful tool to show the details of spectrum by 
separating pitch information from spectral envelope. The real 
cepstrum is usually defined as 


rCjj CiO = real f ^ J ;a.lss [X n S d w j (1} 


I. Introduction 

With rapid changes in the telecommunication network 
environment, the classification of audio signals is one key 
component in many multimedia systems. For instance, most 
codecs are designed to handle signals without discrimination 
and can not work properly in the existence of multimedia 
signals. Many available speech/ music classifier gives very 
good classification accuracy for clean audio input. But when 
we use these systems in noisy atmosphere like: army or 
military field areas, highways, traffic signals, process 
industries and railway stations, their classification accuracy 
are decreased. So there should be a classifier which gives 
good accuracy for these real world noisy applications. To 
design such type of noise robust system for classification, 
energy, pitch and cepstrum based feature are used. For noise 
embedding purpose colored and pink noise signals are used 
and for classification purpose Support Vector Machine [10, 
11] is used, as it is most efficient and accurate classifier. 


n. MOTIVATION 

In real far-field area, available input audio signal is not only 
speech or music type, it may be speech containing music or 
other signal. So designed system should classify these four 
classes. As available surrounding environmental noise is not 
pure white Gaussian noise, it is colored noise. Pink noise is a 
type of colored noise that is appropriately resembles with real 
human audible noise. Thus classification should be done on 
this noise. Further in real world noise-embedding is not on fix 
SNR. So discrimination should be done on random signal to 
noise ratio. 

m. PROPOSED WORK 

Feature Extraction- For analysis purpose features are 
extracted on clean then different-different type of noise is 
embedded to input signal on random SNR and again same 


. Where X rj (jw) is the short-time Fourier transform of the 
n th windowed audio frame, n is the frame index and real (•) 
denotes extracting the real part of the complex value, rc^ (11J 
is a vector that contains all real cepstral coefficients of the n th 
frame signal. The low-order coefficients of rc^CnJ refer to 
the big scale information of spectrum like formants, and the 
high-order coefficients contain the detail information like 
pitches. High order coefficients are captured to distinguish 
between speech and music. Since in most telecommunication 
applications, the audio data are usually disturbed by 
unpredictable noises, the estimation of the accurate pitch 
positions and the holding lengths of pitches in real-time might 
be very difficult. So the pitch density (PD) is used to roughly 
characterize the pitch properties of music and speech, which 
is defined as 

PDCnJ = - ^ abstrCjjCn,, m3) (2) 

m=] ± 

Where L = E 2 — ^1 F 1 

Where rc^ (n, ml is the m-th coefficient ofrc^ (n) . PD (n) 
is the mean of absolute values of high-order coefficients of 
rCjjCnJ within [11, 12]. Based on empirical analysis, the 
average of overall high-order cepstrum content is used. For 
music signals, due to the characteristics of musical 
instruments and the existence of polyphony, the PD is tend to 
be higher than that of speech signals. To get a more reliable 
estimation, the average PD (APD) within an audio segment is 
used, which is defined as 


hftN+N 

APDOc} = - ^ PD(n) (3} 

r.=kftN + l 

Where N is the number of frames contained in an audio 
segment, k is the segment index, and p is the overlapping 
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factor of each segment. Note that APD ( k ) is a scalar extracted 
from the k th audio segment. 


Relative Tonal Power Density (RTPD) - RTPD especially 
focuses on the distinct properties of the percussion 
instruments. For the noise-like music RTPD is considered. 
Firstly, every audio frame is marked as a tonal-frame or a 
non-tonal-frame according to the maximum of the high-order 
coefficients ofrc x GO . That is, if the maximum value is bigger 


than a predefined threshold 0, indicating a significant peak in 
the high-order part, the frame is then marked as a tonal-frame. 
Secondly, within the current audio segment, we compute the 
relative power density ratio of the tonal-frames to overall 
frames, i.e. RTPD, as 


RTPDOd = 



RMS.Cn)) 


(4) 


Where dk denotes all tonal-frames inside the k th analysis 
segment. RMSx(n) is the root mean square of the n lh 
windowed audio frame [1]. Note that RTPD ( k ) is also a scalar 
extracted from the k - th short audio segment. 

The voiced speech usually has stronger energy than the 
unvoiced speech and the background noise, so that if the 
RTPD value is small, the audio signal may not be speech, 
which might be a clip of noisy music, such as rock music. 


Variance of Zero Crossing Rate (varZCR) - It is defined as the 
variance of zero crossing rates for a one second clip, whereas 
zer is defined to be the number of time domain zero crossings 
within a processing window. A zero crossing is said to occur if 
successive samples have different algebraic signs. Thus, 
the zero-crossing rate [2] is the rate of sign-changes along 
a signal, i.e., the rate at which the signal changes from 
positive to negative or vice-versa. 

1 M 

ZCR = ? fM - n 2 - signfe(m - 1))| (5) 

- m=l 

Where, M is total number of samples in a processing 
window and 

x (m) is the value of m th sample. 

High ZCR values correspond to a higher frequency signal 
portion and vice-versa. 

1 N 

var(ZCR) = [ZCR® - avg(ZCR)] (6] 

i = 1 

Where avg (ZCR) is average value of the clip and ZCR (i) 
is the i th frame value. 


Percentage of Low Energy Frame (POLEF)- STE is defined 
to be the sum of squared time domain data. This feature can be 
used in discrimination of audio on the basis of energy. The 
short time energy of a frame is given as 


M-l 

STE = ^x : (m) (7) 

m = D 


Where, M = total no of samples in a processing window, and 
x (m) = value of the m th sample of input speech signal 
POLEF is defined as the percentage of frames whose STE 
value is below 0.5 times average Short Time Energy of a 
particular window . 


H—l 

POLEF = — ^[sign(0.5avSTE- STE(n)) + l] (8) 
n=D 

where N is total number of frames, n is the frame index, 
STE(n) is the Short Time Energy at nth frame, avSTE is the 
average value of STE over the entire window length and sign 
( ) is the sign function. 

Variance of Spectral Flux (varSF) - 

It is defined as the variance of the spectral flux of a clip 
whereas spectral flux is the average variation value of a 
spectrum between the adjacent two frames of one second 
duration. 

^ N-lEC-l 

* = (n - one- n E Z (lx ° W ~ ^ TO 

■ - n=L k=l 

Where X n (k) is the Discrete Fourier Transform of n th 
frame of input signal. K is the order of DFT, N is the total 
number of frames in a clip and n and n-1 are the frame indices. 
1 N 

varSF = N Z [SFG) “ av S< S 0] (1°) 

i=i 

Where avg (SF) is average value of the clip and SF (i) is the 
i th frame value. 

It is a measure of how quickly the power spectrum of 
a signal is changing which is calculated by comparing the 
power spectrum for one frame against the power spectrum 
from the previous frame. It is usually calculated as the 2-norm 
(also known as the Euclidean distance) between the 
two normalized spectra. 

Variance of RMS (varRMS) - It is defined as the variance of 
root mean squire value (RMS) for a one second clip. For this 
purpose first buffer the one second clip into frames of 32ms at 
8 kHz. Then evaluate root mean squire value for each sample 
and then find the variance using following formula. 


RMS = x(i) (11) 

i E 

1 H 

varRMS = [RMS(0 - avg(RMS)] (12) 

i = l 

Where avg (RMS) is average value of the clip and RMS (i) 
is the i th frame value. 

Dynamic Range (DR) - It is defined as the ratio between the 
largest and smallest possible values of a changeable quantity, 
such as sample value audio signal. It is measured as a ratio, or 
as a base- 10 (decibel) or base-2 (doublings, bits or stops) 
logarithmic value. For this purpose first signal is normalized. 

DR = 2O{bgl0(ma5(10[J *y)) - b^lO(min(10O *y))} (13) 

Where, y is one second audio clip. 

Average Mel-Frequency Cepstrum Coefficients (avgMFCC) - 
The motivation for using MFCC is due to the fact that the 
auditory response to the human ear resolves frequencies non 
-linearly. MFCC’s are based on the known variation of the 
human ear’s critical bandwidths with frequency; filters spaced 
linearly at low frequencies and logarithmically at high 
frequencies have been used to capture the phonetically 
important characteristics of speech. 
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N-d x 

[Ihco id v d 2 ) = ^ MCD(n, d 2 )/N - di (17] 
11=0 
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Fig. 1 Steps in MFCC computation 

The computation of Mel-frequency cepstrum is similar to 
that of cepstral coefficients. The difference lays on 
mel-frequency warping before doing logarithmic and inverse 
DFT. 

Take average of all 31 frames and retrieve 12 MFCC 
coefficients for a clip. Then finally find out average MFCC of 
these 12 coefficients and get a single value for linear SVM 
input. 


Average Delta Cepstral Energy (avgDCE) - The delta 
cepstrum measures the temporal change in audio 
characteristics and can be used to track energy change in 
speech or music over time. The energy variation can be 
observed by analyzing the sum of the squares of the delta 
cepstral coefficients for each frame. This sum of squares of 
the delta coefficients is termed the Delta Cepstral Energy 

(DCE). The DCE is computed using 
H 

Edi=X^i) 2 < 10 3 

j=l 

Where dy is the j th delta cepstral coefficient of the i th frame, N 
is the number of delta coefficients and E di is the DCE for the 
i th frame. The computation of the delta MFCC coefficients is 
given by: 

dt = S Lik(e, + k- ^ ^ (19] 

Where N represents the delta window size, c t represents the 
MFCC at frame t and d t is the delta coefficient for frame t. 
Then finally find out average DCE of these 29 coefficients 
and get a single value for linear SVM input. 


Mean of Minimum Cepstral Distance (MMCD) - The 
MMCD parameter is based on cepstral distances using Mel 
frequency cepstral coefficients (MFCC). The MMCD 
parameter finds a minimum cepstral distances among the 
neighbor frames. 

Since the low-order coefficients of cepstrum represent the 
spectral envelope, the cepstral distances between two frames 
becomes a parameter to measure the difference between them. 
The cepstral distance between the n th and (n+d) th frame is 
defined as follows: 


CD(n, d) = l^(e(n -f d) — c(n)) 2 (14) 

A =1 

Where K is the order of cepstrum, c (n, k) is k th cepstral 
coefficient of n th frame and d represents the frame interval 
between the two frames to be compared. 

The mean of cepstral distances is defined as follows: 

M-d 

Mto = y CD(n, d ] /n - d (15] 

n = l 


Then modified cepstral distance (MCD) is given by 

MCD(n, d v d 2 ) = dl , d ^CD(n, d] (16] 

Where dl and d2 indicate the range of candidate frames to 
be searched for minimum cepstral distance. Then we compute 
the MMCD, the mean of MCD, as follows: 


Average Power Spectrum Deviation (avgPSDev) - Speech 
has greater energy at low frequencies, however, in the case 
of music, the higher frequencies also have significant energy. 
Thus, the energy in each filter of filter bank can also be used 
for speech and music discrimination. Power Spectrum 
Deviation (PSDev) is computed as the standard deviation of 
filter bank energies in each band. Thus, PSDev can be found 
using 

15 =^rr-S Ei j- aTsEi ) i (20) 

j = i 

Where Pi is the PSDev for the i th frame, N is the number of 
filters in a filter bank and Ey is the energy in the j th filter of i th 
frame. Where avgEi is the mean energy value for all filters in 
the i th frame and can be computed using 



( 21 ) 


Then finally find out average PSDev of these 29 coefficients 
and get a single value for linear SVM input. 


Noise - Generally in communication, white noise is used for 
noise representation. But for real world noise presentation 
white noise is not correct option as true white noise has 
infinite power with infinite bandwidth. To represent real 
world noise, colored noise is used as it has finite power for 
limited bandwidth. To represent physical noise (audible) 
pink noise is used as it has constant energy per constant 
percentage bandwidth. Comparision of white, colored and 
pink noise can see from figure 2. 
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Figure2: Wave form of white, Colored and Pink noise 

Colored noise - Colored noise is mixture of all type of 
available environmental noises like pink, red, gray etc. From 
figure 3, it can easily notice that white noise has constant 
power spectral density across the entire frequency spectrum 
(extending up to infinity). There is no correlation between the 
samples of a white noise process at different time instances 
i.e. the auto correlation or the auto covariance of white noise 
is zero for all lags except for lag L=0. But for colored noise 
power spectral density of the noise is not uniform across the 
entire frequency spectrum. There exist non-zero values for 
auto correlation or auto covariance at different time 
instances for the colored noise. The auto covariance is 
maximum for zero lag (L=0) and decreases gradually for 
increasing and decreasing values of lag (L). 



Figure 3: Normalized auto-covariance of white and colored noise 


The frequency spectrum of white and colored noise is shown 
in figure 4. For white noise power spectrum is constant for 
all frequency band but for colored noise it is gradually 
decrease as frequency is increased and maximum for 0Hz. 



White noise 



Figure 4: frequency spectrum of white and colored noise 


Colored noise is generated by passing the white noise 
through a shaping filter. The shaping filter is a dynamic filter 
usually a low pass filter. The response of the colored noise 
can be varied by adjusting the parameters of the shaping 
filter. 

whiteNoise=sqrt (variance) *randn (1, length (z) ) ; 
[coloredNoise] =f ilter ( 1-a, [1 -a] , whiteNoise) ; 
Colorednoise_embedded_signal=z+coloredNoise 

Where, 4 z’ is input audio clip, ‘a’ is filter parameter and 
variance for white noise is zero. 

Pink noise - Pink noise is a specific type of colored noise in 
which spectrum is inversely proportional to frequency. From 
figure 5 it can easily notice that white noise has constant 
power spectrum irrespective of frequency whereas pink 
noise has power spectrum inversely proportional to 
frequency. Pink noise has uniform power density for a 
relative bandwidth (octave, decade). It has constant energy 
per constant percentage bandwidth. This equates to a 
-3dB/octave frequency response. 



Figure5: Power Spectrum of White and Pink noise 

Pink noise is generated by passing the white noise through a 
-3dB per octave filter. This filter has parameter 4 A’ and ‘B’, 
those values are specific and generate pink noise with 
+/-0.05dB tolerance [17]. 

B = [0.049922035 -0.095993537 0.050612699 
-0.004408786] ; 

A = [1 -2.494956002 2.017265875 -0.522189400]; 

pinkNoise = filter (B, A, whiteNoise ) ; 
pinknoise_embeddedsignal= z+pinkNoise ; 

where ‘z’ is input audio signal. 

Classifier - Support Vector Machine (SVM) is used as a 
classifier due to its reduced computational complexities and 
greater classification accuracies. 

Support vector machines use supervised learning methods 
for classification. SVMs map input vectors to a higher 
dimensional space if the data is not linearly separable. Then a 
hyper plane is constructed to separate the input vectors. Two 
parallel hyper planes are constructed on each side of the hyper 
plane. The hyper plane that maximizes the distance between 
the two parallel hyper planes is found to be the solution. In 
linear non-separable cases, a kernel function is required to 
transform the original feature space to a higher dimensional 
space in an implicit way such that the mapped data is linearly 
separable. Common kernels include Polynomial, Gaussian 
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Radial Basis Function, Sigmoid, etc. The choice of kernel is 
an important issue in SVM classification. 

Optimal Hyper Plane for Linearly Separable Patterns: 
Consider the training sample , dj}]^ where Xj is the 
input pattern for the i th example and dj is the corresponding 
desired response (target output). Let us assume that the 
pattern (class) represented by the subset dj = -|-1 and the 
subset dj = -1 are linearly separable. The equation of a 
decision surface in the form of a hyper plane that does the 
separation is 

w T x+- b = 0 (22) 

Where x is the input vector, w is an adjustable weight vector 
and b is the bias. 

Thus, 

w T X] -h b > 0 for dj = -hi (23) 
w T Xj -F b < 0 for dj = — 1 (24) 



Figure 6: Geometric construction of optimum hyper plane for 
two dimensional input space 


Hyper Plane for Non-Separable Patterns: To set the stage for 
formal treatment of non separable data points, a new set of 
non negative scalar variables is introduced in the 

definition of separating hyper plane. The ^ are called slack 
variables , they measure the deviation of data point from the 
ideal condition of pattern separability. 
dj(w T X| -fb) > 1 — ^ for i = L 2, ... ... N (25) 

The support vectors are those particular data points that 
satisfy equation (25) precisely even if ^j > 0 .The primal 
problem in case of non separable case may thus be formally 
defined as follows, where C is user specified positive 
parameter also called as SVM penalty/cost parameter. 



Figure 7: Non-Separable Training Sets introduces misclassification 

And weight vector w minimizes the cost function 
1 H 

tp(w) = -W T W -F C % (26) 

i=i 


The parameter C controls the tradeoff between complexity of 
the machine and the number of the non separable points; it 
may be therefore viewed as regularization parameter .The 
parameter C has to be selected by the user .This can be done in 
one of the two ways. The parameter C is determined 
experimentally via the standard use of training / (Validation) 
test set, which is crude form of resampling. It can be 
determined analytically. 

For patterns that are not linearly separable, the following 
mathematical operations are performed in construction of 
SVM optimal hyper plane. 

1. Non linear mapping of input vector into 
high-dimensional feature space that is hidden from 
both input and output. The low dimensional input 
data x is mapped into a high-dimensional feature 
space by mapping function qp(x). 

2. Construction of optimum hyper plane for features 
obtained for separating the features discovered in 
step 1. 



Figure8: Non linear mapping from the input space to higher dimension 
feature space 

The separating hyper plane is now defined as a linear function 
of vectors drawn from the feature space rather than original 
input space. 

Inner product kernel: The term ip- (xj jip(x) represents the 
inner product of two vectors induced in the feature space by 
the input vector x and the input pattern Xj pertaining to the i th 
example the inner product kernel denoted by K(x, Xj) and 
defined by 

KCx.Xi) = ^GO^pCxi) fori = 1,2, N (27) 

In multiclass SVM, Radial Basis Function (RBF) kernel is 
used. As the RBF kernel nonlinearly maps samples into a 
higher dimensional space, so it, unlike the linear kernel, can 
handle the case when the relation between class labels and 
attributes is nonlinear. In RBF kernel number of hyper 
parameter is less which influence model complexity. Its 
equation is given as 

K(x, x^ = e -Y ^ for i = 1 , 2, . . . N (2 8) 

The width y is kernel parameter specified apriori by the user 
(Y > °)- 

Cross-validation and Grid-search: There are two parameters 
while using RBF kernels: C and y. It is not known beforehand 
which C and y values are the best for one problem; 
consequently some kind of model selection (parameter search) 
must be done. The goal is to identify good (C, y ) so that the 
classifier can accurately predict unknown data (i.e., testing 
data). Note that it may not be useful to achieve high training 
accuracy (i.e., classifiers accurately predict training data 
whose class labels are indeed known). Therefore, a common 
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way is to separate training data into two parts of which one is 
considered unknown in training the classifier. Then the 
prediction accuracy on this set can more precisely reflect the 
performance on classifying unknown data. An improved 
version of this procedure is cross-validation. 

In v-fold cross-validation, the training set is first divided into 
v subsets of equal size. Sequentially one subset is tested using 
the classifier trained on the remaining v-1 subsets. Thus, each 
instance of the whole training set is predicted once so the 
cross-validation accuracy is the percentage of data which are 
correctly classified. The cross-validation procedure can 
prevent the over fitting problem. It is recommended to use 
“grid-search" on C and y using cross-validation. 

“Grid.py” is a program which performs a “grid-search" on C 
and y using cross-validation. Basically pairs of (C, y) are tried 
and the one with the best cross-validation accuracy is picked. 
An exponentially growing sequences of C and y is a practical 
method to identify good parameters (for example, C = 2 A -5, 
2 A -3, ...., 2 A 15; and y = 2 A -15, 2 A -13,...., 2 A 3).The 
grid-search is a straightforward approach to determine the 
optimum values of C and y. 

There are three reasons of preferring grid-search approach 
over other methods: 

1. It does an exhaustive parameter search by 

approximations or heuristics. 

2. The computational time to find good parameters by 

grid-search is comparable to that by advanced 
methods, since there are only two parameters to be 
determined. 

3. Unlike the advanced iterative processes, grid-search 

can be easily parallelized because each (C, y) is 
independent. 

train, m. seal# 


Best logZSC) = 3 log£< gamma) = -7 accuracy 

99.5 

C = 0 gamma = 0.50703,23 99 



log2CC> 

Figure 9: Plot of log 2 C vs log 2 gamma 

IV. CONCLUSION AND FUTURE SCOPE 

We reviewed a noise robust four class classification system 
with accuracy of 96.5% at clean audio input and 95% at pink 
noise embedded audio input. As proposed system deals with 
real world noise (colored and pink noise) so it is applicable in 
real world application. Also its computation time is 0.96 
second which is less in comparison of clip time (1 second). So 
this system is also useful in real time application. Future 
work will focus on optimizing number of features used for 
classification purpose. 
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